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An Improved Composite Hypothesis Test for Markov Models with 
Applications in Network Anomaly Detection* 

Jing Zhangt and loannis Ch. Paschalidis* 


Abstract —Recent work has proposed the use of a composite 
hypothesis Hoeffding test for statistical anomaly detection. 
Setting an appropriate threshold for the test given a desired 
false alarm probability involves approximating the false alarm 
probability. To that end, a large deviations asymptotic is 
typically used which, however, often results in an inaccurate 
setting of the threshold, especially for relatively small sample 
sizes. This, in turn, results in an anomaly detection test that 
does not control well for false alarms. In this paper, we develop 
a tighter approximation using the Central Limit Theorem 
(CLT) under Markovian assumptions. We apply our result to 
a network anomaly detection application and demonstrate its 
advantages over earlier work. 

Index Terms —Hoeffding test, weak convergence, Markov 
chains, network anomaly detection. 

1. Introduction 

During the last decade, the applications of statistical 
anomaly detection in communication networks using the 
Large Deviations Theory (LDT) [1] have been extensively 
explored; see, e.g., [2], [3], [4], among others. Statistical 
anomaly detection involves characterizing the “normal be¬ 
havior” of the system and identifying the time instances 
corresponding to abnormal system behavior. Assuming that 
the network traffic is stationary in time, [2] applies two 
methods to that end. The first method, termed ''model-free'' 
models traffic as an independent and identically distributed 
(i.i.d.) sequence. The second method, termed "model-based'' 
models traffic as a finite-state Markov chain. These are then 
extended in [4] to the case where the network traffic is 
time-varying. Essentially, each of these methods is designed 
to tackle a certain Universal Hypothesis Testing Problem 
(UHTP). 

A UHTP aims to test the hypothesis that a given sequence 
of observations is drawn from a known Probability Law (PL) 
(i.e., probability distribution) defined on a finite alphabet [5]. 
It is well known that the test proposed by Hoeffding [6] 
is optimal in an error exponent sense [1], [2], [4]. When 
implementing Hoeffding’s test, one must set a threshold rf, 
which can be estimated by using Sanov’s theorem [1], a 
result within the scope of LDT. Note that such an estimate 
(let us denote it by is valid only in the asymptotic sense. 
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In practice, however, only a finite number of observations are 
available, and it can be shown by experiments (e.g., using 
the software package TAHTIID [7]) that is typically not 
accurate enough. 

To improve the accuracy of the estimation for r], [5] (or, 
see [8]) borrows a method typically used by statisticians; 
that is, deriving results based on Weak Convergence (WC) 
of the test statistic to approximate the error probabilities 
of Hoeffding’s test. Under the i.i.d. assumption, by using 
the technique combining WC results together with Taylor’s 
series expansion, [5] (or, see [8], [9]) gives an alternative 
approximation for rj (let us denote it by rj^^). Using [7], 
one can verify that, in the finite sample-size setting, is 
typically much more accurate than 

It is worth pointing out that some researchers have also 
tried to obtain a more accurate approximation for rj by 
refining Sanov’s theorem [10]. However, as noted in [8], 
such refinements of large deviation results are typically hard 
to calculate. 

In this work we establish under a Markovian as¬ 
sumption, thus, extending the work of [5] which considered 
the i.i.d. case. We apply the proposed procedure to network 
anomaly detection by embedding it into the software package 
SADIT [11]. 

The rest of the paper is organized as follows. We first 
formulate the problem in Sec. II and derive the theoretical 
results in Sec. III. We then present the experimental results 
in Sec. IV and finally provide concluding remarks in Sec. V. 

Notation: By convention, all vectors are column vectors. 
To save space, we write x = (vi,..., to denote the column 
vector X. We use prime to denote the transpose of a matrix 
or vector. 

II. Problem Formulation 

Let E = /= 1,...,/V} be a finite alphabet. The ob¬ 

servations Y = {F/; / = 0,1,2,...} form a stochastic process 
taking values from S. 

In the model-based UHTP [1], [2], [4], under the null 
hypothesis the observations Y = {F/; / = 0,1,2,...} are 
drawn according to a Markov chain with state set S. Let its 
transition matrix be Q = {qij)^ 

Let 1|. j denote the indicator function. Define the empirical 
PL by 

= ( 1 ) 

where Z/ = F/), / = Gij = (4', G S x 

S, /,7 = 1,... ,A/, and denote the new alphabet 0 = 



{Oij; ij =1,... ,A^} = {Ok, k=l,... ,N^}. Note that 0 = 
S X S and §1 = On, • • •, 0^ = Oin, ..§(a^_i)a^+i = Om, • • •, 
= Onn- Let also the set of PLs on 0 be ^{(d). 

It is seen that the transformed observations Z = 
{Z/; / = 1,2,...} form a Markov chain evolving on 0. Let its 
transition matrix be P = with stationary distribution 

where Ttij denotes the probability of seeing Otj, and 7ti= Tin, 
Tt^ = TTin, •••, ^(A^- 1 )A ^+1 = •••, %2 = TTnN’ Let 

Pietjl Bki) denote the transition probability from %/ to Bij. 
Then we have 


p{Bij\Bki) = k,lpj= 1,...,A^, 


which enables us to obtain P directly from Q. 

Write 7t = and = 

(r„ • • • ,r„ • • • ,r„ • • • ,r„ (^a^a^)). 

Define the relative entropy (i.e., divergence) 


Z)(r„||;r) = ££r„(%)log--, (2) 


i=\j=\ 




and the empirical measure 


u« = V«(r„-;r). (3) 


For any positive integer n, let ^ be the output of the 
test that decides to accept or to reject the null hypothesis 
based on the first n observations in the sequence Z. Under 
the Markovian assumption, Hoeffding’s test [1] is given by 

^ = ^<^£)(r„||ff)<T7„, (4) 

where D(r„|| ;r), defined in (2), is the test statistic, and t]„ 
is a threshold. Note that, when applied in anomaly detection, 
Hoeffding’s test will report an anomaly if and only if ^ ^ 
or, equivalently, D(r„|| ;r) > 

Under hypothesis the false alarm probability [5] of 
the test sequence {^} as a function of n is defined by 

^n=P^{D{Tn\\K)>nn}, (5) 

where Pjf{A} denotes the probability of event A under hy¬ 

pothesis M'. Sanov’s theorem [1] implies the approximation 
for T]„ 

Vn - --^OgiPn). (6) 

n 

Given G (0,1), our central goal in this paper is to improve 
the accuracy of the approximation for T]„. 


III. Theoretical Results 
We introduce the following condition: 

Condition (C). Z = {Z/; / = 1,2,...} is an aperiodic, ir¬ 
reducible, and positive recurrent Markov chain [12], [13] 
evolving on 0 with stationary distribution It, and with the 
same 7t as its initial distribution. 

Remark 1: Since 0 is a finite set, we know that Z is uni¬ 
formly ergodic [12], [14]. Assuming the initial distribution 


to be 7t is for notational simplicity; our results apply for any 
feasible initial distribution. Note also that, under (C), It must 
have full support over 0, meaning each entry in it is strictly 
positive. 


Lemma III.l Assume (C) holds. Then 


1 


Lh yti, 7t,, 


= qij, = 


(7) 


Proof: Expanding the first N entries of ItF = It, we 
obtain quY^^i / = 1, • • • Summing up both sides 

of these equations, it follows 

(!".«.<) Ll,(*) 

Noticing = 1, we see that (8) implies = 

ZfLi which, together with qu Tlti = TTn, yields 
^11 _ ^11 _ 

Similarly, we can show (7) holds for all the other {ijfs. ■ 
Remark 2: By Remark 1 and Lemma III.l we see that, 
under Condition (C), all entries in Q are strictly positive, 
indicating that any two states of the original chain Y are 
connected. This is a strict condition; yet, in engineering 
practice, if some Hij in (7) are equal to zero, we can always 
replace them with a small positive value, and then normalize 
the modified vector It, thus ensuring that Condition (C) can 
be satisfied. See Sec. IV-A for an illustration of this trick. 


A. Weak convergence of 

Let us first establish CLT results for one-dimensional 
empirical measures 

t/W = x/^(r„ (%)-%), k=l,...,N\ (9) 

Fix k e {!,... ,N^}. Define 


fk{^) — l{z=4} 


1, ifZ=%, 

0, ifZG0\{4}. 


( 10 ) 


Lemma III.2 Assume (C) holds. Then a CLT holds 
for Un\' that is, Un^ > ./L (O, cr?) with a? = 

n^oo \ K / K 

Cov(A(Zi), /^(Zi)) + 2i:“=iCov(A(Zi), fkiZi+m)) < 


00. 


Proof: This can be established by applying [12, Corol¬ 
lary 1]. Noting /^(•) is bounded and the chain Z is uniformly 
ergodic, we see that all the conditions needed by [12, 
Corollary 1] are satisfied. ■ 

Now we can directly extend Lemma IIL2 to the multidi¬ 
mensional case (for an informal argument, see, e.g., [15]; 
cf. [16, Sec. III.6] for the definition of a Gaussian random 
variable with a general normal distribution). Due to its role in 
the following applications, we state the result as a theorem. 


Theorem III.3 Assume (C) holds. Then a multidimensional 
CLT holds for U„; that is, 

u„^^yr(o, A) 

)-oo 


( 11 ) 






with A being an x covariance matrix given by 

A = Ao + ^^^^A^, (12) 

where Aq and A^ are specified, respectively, 
by Aq = [Cov(/;-(Zi), /j(Zi))]^^^j and = 

[Cov(/i(Zi), /,(Zi+^))+Cov(/,(Zi), fi{Z,+„,))]f.^^, 
m= 1,2,.... 

The rest of this subsection is devoted to the computation of 
A; we will express A = [A^'in terms of the quantities 
determining the probabilistic structure of the chain Z. 

In particular, we have 

and 

= Cov(l|2j=e,}’ l{Zi+„=ey}) 

+ Cov(l{2j=§^.}, l{Zi+„= 0 ,})’ 
m= 1,2,...; 1,7 = 1,...,^^. 

We first determine ( 13 ) as follows. By direct calculation, 
we obtain Aq’^^ = iti (1 — iti), Aq’^^ = — 7 t\ 7 t 2 , and so on. In 
general, we have Aq = fti {Itj — ftj) ^ /, 7 = 1,..., where 
\ij denotes the (/,7)-th entry of the identity matrix. 

Now we compute A2 by ( 14 ). Omitting the details, again 
by direct calculation, we arrive at Cov (l{Zi= 6 i,}>l{Z 3 =ej}) - 
— Ttj), where is the (/,7)-th entry of the matrix 
(the square of the transition matrix P). Interchanging the 
indexes i,j, we obtain Cov(l{2^^g,}, 1^23=0,}) 

TT,). Thus, we have = tt, (Pjj — Ttj) + Mj (P?- — . 

Computing in a similar way, in general, we derive 

= fti (p™ _ ftj) + Ttj (P^ -Tti), m = 1,2,..., 

where P^ is the (/,7)-th entry of the matrix (the m-th 
power of the transition matrix P). 

Thus, for ij= 1 ,... ,A^ we finally attain 

Ab’b = Tti (I,-7 - Ttj) + Y 2 =i [^i (P" - %) + % (P™ - ^ 0 ] • 

( 15 ) 

B. Taylor’s series expansion of DiTn ||;r) 

Inspired by [ 5 ] (or, see [8]) wherein the i.i.d. assumption is 
required, we use a Taylor’s series expansion to approximate 
the relative entropy T>(r„ ||^). 

To this end, for V G ^ ( 0 ) let us consider 

Vjj 

/i(v)=D(v||®) = ^^v,7log^^. ( 16 ) 

By direct calculation, we derive 

= log Vij - log J V,,) - log Itij 
+ log(£"j%), 1,7 = 1 ,..., A, ( 17 ) 

which leads to 

Vh{lt)={). ( 18 ) 


Further, from (17), we compute the Hessian V^h{v) by 
considering three cases: 

(i) If 1 , then = 0 . 

(ii) If ^ = / and I = 7 , then 

d^hjv) _ d^hjv) _ 1 1 

dVijdvu~ dvfj ~ Vij 

(iii) If k = i and / j, then 

d'^h{v) _ d'^h{v) _ 1 

dVijdVki ~ dVijdVii ~ l!j=iVij' 

All these terms evaluated at It yield V^/i {it), which will play 
a crucial role in approximating T>(r„ ||;r). 

From the ergodicity of the chain Z, it follows 7C. 

)-oo 

It is seen that both Vh{v) and V^h{v) are continuous in a 
neighborhood of It, and we can use the second-order Taylor’s 
series expansion of h{v) centered at It to approximate 
D{Tn 11^) = h^Tn) —h{n). To be specific, we have 

z) (r„ II ;r )«1 (r„ - it)'v'^h{it) (r„ ~it). ( 19 ) 

C. Threshold approximation 

We use an empirical CDF to approximate the actual CDF 
of D(r„|| ;r). In particular, by (19) we have 

£)(r„ ||;r) « 2 (Vi(r„ - n))'v^h{It) (v^(r„ - It )), 

( 20 ) 

where, by Thm. III.3, ^Jn{Tn — ^ > -2F(0, A) with A 

n^oo 

given by (15). Thus, to derive an empirical CDF, we can 
generate a set of Gaussian sample vectors independently 
according to c/F(0, A) and then plug each such sample 
vector into (20) (i.e., replace ^/n{^n — 7t)) and compute the 
right hand side of ( 20 ), thus, obtaining a set of sample 
scalars, as an approximation for samples of D(r„ ||;r). 

Once we obtain an empirical CDF of D(r„||;r), say, 
denoted then, by (5), we can use it to estimate 

Pn as 

rir^F-J{l-l5„-,n), ( 21 ) 

where n) is the inverse of Fem{'\ n). 

Clearly, to calculate the right hand side of (20), we need 
the parameter 7t. In practice, we typically do not know the 
actual values of the underlying chain parameters, so we need 
to estimate them accordingly. This can be done by computing 
Tn over a long past sample path. 

IV. Experimental Results 
A. Numerical results for threshold approximation 

The experiments in this subsection are performed using 
the software package TAHTMA [17]. For convenience, we 
use 0 = { 1 , 2 ,... ,A^} to indicate the states, and assume Tt 
(ground truth) is the initial distribution. 

In the following numerical examples, we first randomly 
create a valid (i.e., such that (C) holds) N x N transition 
matrix Q, hence an x transition matrix P, and then 





generate T samples of the chain Z, each with length n, 
denoted = {zf, t = l,...,r. These samples 
will be used to derive empirical CDF’s. Also, to estimate the 
parameters of interest, we generate one more training sample 
= {Z^i \-.. }, where no is the length. The stationary 

distribution 7t (estimation) of the chain is computed by taking 
any row of , where mo is a sufficiently large integer. 

With samples Z") = {zfwe can 
compute T samples of the scalar random variable D (F^ ||;r), 
by (2). Thus, an empirical CDF of D(r„||;r), denoted 
F(-; n), can be derived. We will treat F n) as the actual 
CDF of D (r„ ||;r). The threshold given by (21) with F (•; n) 
(i.e., n) plays the role of Fem{'\ ^)) is then taken as the 
theoretical (actual) value r]„. 

Next we derive an estimated empirical CDF of DiVn ||^), 
using Taylor’s series expansion together with Thm. III.3. 

Let us first estimate the parameters of interest. Recall it = 
= 1,...,A) = {7tk\ We estimate itk 

by 

-- l,e\, k=l,...,N\ (22) 

no I 

where e > 0 is a small number. The purpose of introducing 
£ is to avoid division by zero. Then, we obtain an estimated 
It SiS It = k = 1,...,A^), or written It = 

{ftij; ij= 1,...,A), where ftn = nijs, ..., tcin = Tt^/s, 

= f(Ar_i)Ar+i/s, nNN = ^N2/s, with f = ^7 

being the normalization constant. 

We are now ready to calculate the estimate for (it), by 
evaluating (it ). 

Further, by Lemma III.l, we estimate Q by 

iJ=l,...,N. (23) 

Then an estimation of P, denoted P, can be directly derived 
from Q. 

Finally, by (15), we estimate the covariance matrix A using 

= ki ( 1,7 - kj)+ y!Zi [^i m - %) 

+ nj{Pji-ni)], /,7 = (24) 

It is worth noting that, to ensure the estimated A is at least 
symmetric, we update A by (A + A ) / 2. Also, to ensure A to 
be positive semi-definite, we use the trick of QR factorization 
[18]; for details, the reader is referred to [17]. 

We now generate T Gaussian sample vectors according to 
./L(0, A) and then plug each of them into (20) and compute 
the right hand side of (20) with {it) replaced by {it), 
thus, obtaining T scalar samples as approximations for sam¬ 
ples of D{Tn ||;r). Therefore, an estimated empirical CDF of 
D(r;^||;r), denoted Fem{'\ n), can be derived accordingly. 

given by (21) is then the WC approximation for the 
threshold. 

Let the false alarm probability be = j3 = 0.001. We set 
the aforementioned parameters as A = 12, e = 10“^, T = 
1000, mo = 1000, and let no = lOOOA^. In Fig. 1, the red line 


indicates the theoretical value of the threshold r]„, the blue 
line shows the WC approximation and the green line 
demonstrates the estimate obtained by Sanov’s theorem, 
all as a function of the number of samples n. It is seen that 
is much more accurate than 



n (number of samples) 

Fig. 1. Threshold (rf) versus number of samples (n); the number of states 
in the original Markov chain is N = 12. 

Remark 3: In Fig. 1, the red line corresponding to the 
actual value is not as smooth; the reason is that each time 
when varying the number of samples n, we regenerate all the 
samples Z^^^ = ,..., f = 1,..., T, from scratch. On 

the other hand, the blue line corresponding to looks 
smooth; this is because we only need to generate the T 
Gaussian sample vectors according to c/L(0, A) once. Note 
that, in our experiments, most of the running time is spent 
generating the samples Z^^^ = {z|^\... ,Z^^^}, t = l,...,r, 
and then calculating rin- In practice, when implementing 
the approximation procedure proposed in this subsection, 
we will only need to focus on obtaining which is 

computationally inexpensive. 

B. Simulation results for network anomaly detection 

We will use the term traffic and flow interchangeably in 
this subsection; they mean the same thing. The simulations 
are done using the software package SADIT [11], which, 
based on the/^'-simulator [19], can efficiently generate fiow- 
level network traffic datasets with annotated anomalies. 

The network (see Fig. 2) consists of an internal network 
involving 8 normal users {CT1-CT8) and 1 server {SRV) 
that stores some sensitive information, and 3 Internet nodes 
{1NT1-1NT3) that visit the internal network via a gateway 
{GATEWAY). 

When dealing with the data, we use the features “fiow 
duration,” “fiow size,” and “distance to cluster center” (this 
is related to a user IP address). The data preprocessing 
involves three steps: i) network traffic representation and flow 
aggregation, ii) quantization, and iii) window aggregation', 
cf. [4, Sec. III.A]. 

To implement Hoeffding’s test, we first estimate a PL 
(resp., a set of PLs) for the stationary (resp., time-varying) 
normal traffic. For the former case, we use as the reference 
traffic the whole fiow sequence with anomalies at some 












Fig. 2. Simulation setting (this figure is from [4] (or, see [20])). 


time interval that we seek to detect. For the latter case, we 
generate the reference traffic and the traffic with anomalies 
separately, sharing all of the parameters except the ones 
regarding the anomalies. Using the windowing technique 
[20], [4], we then persistently monitor the traffic and report 
anomalies instantly whenever the relative entropy is higher 
than the threshold corresponding to the detection window. 

The key difference between the whole detection proce¬ 
dures for the two types of traffic is that, estimating a PL 
for the stationary traffic is straightforward, while, for the 
time-varying traffic, we need to make an effort to identify 
several different PLs corresponding to certain time intervals. 
We apply the two-steps procedure proposed in [4]; that is, 
with some rough estimation of the shifting and periodic 
parameters of the traffic, we first generate a large set of 
PLs, and then refine the large family of PLs by solving a 
weighted set cover problem. For further theoretical details 
and the implementation, the reader is referred to [4], [11]. 

Note that, to deal with the time-varying traffic, [4] pro¬ 
poses a generalized Hoeffding’s test, which we will use for 
Scenario 2 in the following. However, there is no essential 
difference compared to the typical Hoeffding’s test. Note also 
that, in the following experiments, we only use the model- 
based method [20], [4]. 

1) Scenario 1 (Stationary Network Traffic): We mimic the 
case for anomaly caused by “large access rate;” cf. [20, Sec. 
IV.A.3]. The simulation time is 7000 s. A user suspiciously 
increases its access rate to 10 times of its normal value 
between 4000 s and 4500 s. The interval between the starting 
point of two consecutive time windows is taken as 50 s, the 
window-size is chosen to be = 400 s, and the false alarm 
probability is set as j3 =0.001. 

Set the quantization level for “flow duration,” “flow size,” 
and “distance to cluster center” to be = 1, ^2 = 2, and 
^3 = 2, respectively. Set the number of user clusters as k = 3. 

See Fig. 3 and Fig. 4 for the detection results. Both figures 
depict the relative entropy (divergence) metric in (2). The 
green dashed line in Fig. 3 is the threshold calculated using 
Sanov’s theorem (i.e., r]ff, where n is the number of fiows in 
a specific detection window). The green dashed line in Fig. 
4 is the threshold computed using the WC result derived in 
Sec. Ill (i.e., T]^^). The interval during which the entropy 
curve is above the threshold line (the red part) is the interval 



Fig. 3. Detection result for Scenario 1 with Wg = 400 s, ni = 1, ^2 = 2, 
n 3 = 2, k = 3; threshold is estimated by use of Sanov’s theorem. 



Fig. 4. Detection result for Scenario 1 with = 400 s, ni = 1, ^2 = 2, 

= 2, k = 3; threshold is estimated by use of weak convergence result. 

reported as abnormal. 

Fig. 3 shows that, if using 7]^^ as the threshold, then 
the detection method reports an anomaly for all windows 
including the ones wherein the behavior of the network traffic 
is actually normal; in other words, there are too many false 
alarms. Clearly, this is undesirable. Fig. 4 shows that, if, 
instead, we use as the threshold, then the detection 
method will not report any false alarm while successfully 
identifying the true anomalies between 4000 s and 4500 s. 
The reason for such detection results is that is more 
accurate than 

2) Scenario 2 (Time-Varying Network Traffic): Consider 
the network with day-night pattern where the fiow size 
follows a log-normal distribution. We use exactly the same 
scenario as that in [4, Sec. IV.B.2]. 

Applying the same procedure as in [4, Sec. III.C], we first 
obtain 32 rough PLs. Then, using the heuristic PL refinement 
algorithm given in [4, Sec. III.D] equipped with the up-bound 
nominal cross-entropy parameter A = 0.027565, which is 
determined by the WC approximation, we finally obtain 6 
PLs (see Fig. 5). These PLs are active during morning, 
afternoon, evening, mid-night, dawn, and the transition time 
around sunrise, respectively. In the following detection pro- 

























cedure, the key difference between our method and the one 
used in [4] is that we no longer manually set the threshold 
universally as a constant; instead, the threshold for each 
detection window is automatically calculated by use of the 
WC approximation. We set the quantization level for “flow 
duration,” “flow size,” and “distance to cluster center” to be 

= 1, ^2 = 4, and ^3 = 1, respectively, and set the number 
of user clusters to k = 1. The interval between the starting 
point of two consecutive time windows is chosen as 1000 s, 
the window-size is taken as = 1000 s, and the false alarm 
probability is set as j3 =0.001. 

Noting that the anomaly is simulated beginning at 59 h 
and lasting for 80 minutes [4], we see from Fig. 6 that the 
anomaly is successfully detected, without any false alarms. 



time (h) 

Fig. 5. PLs identification for Scenario 2 with Ws = 1000 s, = 1, ^2 = 4, 
^3 = 1 , ^ = 1 . 



Fig. 6. Detection result for Scenario 2 with = 1000 s, = 1, ^2 = 4, 
^3 = 1 , ^ = 1 . 

V. Conclusions 

We establish a weak convergence result for the relative 
entropy in Hoeffding’s test under a Markovian assumption, 
which enables us to obtain a more accurate approximation 
(compared to the existing approximation given by Sanov’s 
theorem) for the threshold needed by the test. We validate 
the accuracy of such approximation through numerical ex¬ 
periments and simulations for network anomaly detection. 
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